Using Multiple Metrics in Automatically Building Turkish Paraphrase Corpus
نویسندگان
چکیده
Paraphrasing is expressing similar meanings with different words in different order. In this sense it is viewed as translation in the same language. It is an important issue in natural language processing for automatic machine translation, question answering, text summarization and language generation. Studies in paraphrasing can be classified as paraphrase extraction, paraphrase generation, paraphrase recognition. In this paper we present automatic sentential paraphrase extraction from comparable texts downloaded from Turkish newspapers related to similar news. We applied seven text similarity metrics and assumed the two most similar ones as candidates. Through an interface these are shown to 3 human annotators to be labelled as paraphrase, entailing, entailed, opposite in meaning and not paraphrase. In this paper we only present results driven from a single topic. The sentences in the other topics will be processed based on the experience gained in the current work. This will be the first automatically built and golden standard tagged Turkish paraphrase corpus.
منابع مشابه
A Case Study Towards Turkish Paraphrase Alignment
Paraphrasing is expressing the same semantic content using different linguistic means. Although previous work has addressed linguistic variations at different levels of language, paraphrasing in Turkish has not been yet thoroughly studied. This paper presents the first study towards Turkish paraphrase alignment. We perform an analysis of different types of paraphrases on a modest Turkish paraph...
متن کاملLeveraging Paraphrase Labels to Extract Synonyms from Twitter
We present an approach for automatically learning synonyms from a corpus of paraphrased tweets. The synonyms are learned by using shallow parse chunks to create candidate synonyms and their context windows, and the synonyms are substituted back into a paraphrase detection system that uses machine translation metrics as features for a classifier. We find a 2.29% improvement in F1 when we train a...
متن کاملTurkish Paraphrase Corpus
Paraphrases are alternative syntactic forms in the same language expressing the same semantic content. Speakers of all languages are inherently familiar with paraphrases at different levels of granularity (lexical, phrasal, and sentential). For quite some time, the concept of paraphrasing is getting a growing attention by the research community and its potential use in several natural language ...
متن کاملRe-examining Machine Translation Metrics for Paraphrase Identification
We propose to re-examine the hypothesis that automated metrics developed for MT evaluation can prove useful for paraphrase identification in light of the significant work on the development of new MT metrics over the last 4 years. We show that a meta-classifier trained using nothing but recent MT metrics outperforms all previous paraphrase identification approaches on the Microsoft Research Par...
متن کاملBuilding a Non-Trivial Paraphrase Corpus Using Multiple Machine Translation Systems
We propose a novel sentential paraphrase acquisition method. To build a wellbalanced corpus for Paraphrase Identification, we especially focus on acquiring both non-trivial positive and negative instances. We use multiple machine translation systems to generate positive candidates and a monolingual corpus to extract negative candidates. To collect nontrivial instances, the candidates are unifor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Research in Computing Science
دوره 117 شماره
صفحات -
تاریخ انتشار 2016